1. Survey Design and Statistical inference
UK Data Service
October 2025
11.00-11.15 Introductions
11.15-12.00 Session 1: Survey Design: a refresher
Coffee break
12.15-13.00 Session 2: Inference in theory and practice
Lunch break
13.45-14.30 Session 3: R and Stata examples (1)
Coffee break
14.45-15.30 Session 4: R and Stata examples (2)
Main repository of UK secondary social science data
A provider of support, training and guidance
Freely accessible, funded by the ESRC
Who are we for?
Survey microdata:
Aggregate databases: i.e. OECD, etc…
Census data – modern and historic records
Business and administrative microdata
One off surveys, multimedia and qualitative data deposited on Reshare
tl;dr: some basic degree of familiarity with the topic is assumed
{fig-alt: “Stylised image of the circular relationship between inference and sampling .nostretch width=”80%” fig-align=“center”}
Survey design: strategies used to collect samples.
Sample members can either be selected:
The process of deriving population estimates from a sample is called inference
Random sampling minimises the risk of obtaining unrepresentative samples and biased population estimates:
Simple random sampling - directly drawing members of the target population - is considered the best way to conduct RS and avoid bias…
… But difficult to achieve: need a list of the population (national register).
… Not necessarily optimal:
Designing surveys entails striking a balance between:
Large scale social surveys tend to rely on techniques better at striking this balance than SRS
In effect SRS is more of a theoretical possibility than a real world sampling technique for social surveys
Clustering
Stratifying also consists in dividing the population into groups according to predetermined characteristics, but this time units are drawn from all of them
Adjusting sampling proportion ie sampling more/less than the rest of the population for some groups
No sampling frame ie national register of the population
The closest to it (in Great-Britain) is Royal Mail’s Postcode Address File: i.e. a structured list of addresses
For Northern Ireland the most commonly used is the Land and Property Services Agency’s (LPSA).
We cannot use the PAF to directly draw samples of households or individuals, as their number at each address is not know.
However, the structure of the PAF easily enables geographical clustering of surveys. Addresses, or ’delivery points’ cluster into larger units
Also addresses receiving unusually large amounts of mail - likely to be businesses or institutions can be filtered out
The previous figure illustrates clustering with four districts:
The higher level clusters, i.e. those at which the first random draw happened, are the Primary Sampling Units (PSUs).
Districts 1 and 4 have been selected to be in the sample.
A second stage of sampling follows: addresses are sampled from within the two selected districts
Subsequent drawing of either:
In large scale surveys the PSUs are often geographical areas.
Arises in some large-scale household surveys such as the Labour Force Survey.
Imagine:
Those born abroad are more likely to live together ‘clustered’ within households, than spread randomly.
Some households are wholly overseas born, some mixed and most wholly UK born.
e.g.
Household 1: 1 UK born individuals
Household 2: 3 UK born
Household 3: 2 Overseas born
Household 4: 6 UK born
Household 5: 1 Overseas born, 1 UK born
Household 6: 2 UK born
Household 7: 1 UK born
Household 8: 1 UK born
Household 9: 5 Overseas born
Household 10: 3 UK born
And so on…
Clustering within households means that if we draw one in ten of the households for our sample we might expect the sample to be less accurate in predicting the proportion of our population who were born outside the UK than if we had sampled individuals at random.
Clustering comes at the cost of making the sample coarser - as we shrink the size of the population from which it is drawn - reducing its diversity - which in turn makes the estimates draw from it less precise.
UK surveys are usually stratified:
Such information is usually obtained from (area-level) Census data.
In simple random sampling, each element drawn from the sampling frame has an equal selection probability.
In stratified sampling, proportionate stratification is when the same sampling fraction is used across all strata:
Disproportionate stratification means that the sampling fraction varies across strata.
Disproportionate sampling results in some groups being over-represented in the sample: